Computational Lab: 1-30-09

  1. Start with the files acghTable.dat and probeList.dat. Using the probeIDs as unique identifiers, match each probe's value to it's position and transform the data for sample "0001" into lff format.
      >
    • Take the scores from column "0001" and insert them into the 'score' field of your lff.
    • Filter out any probes with no value (or "NA").
    • Use "+" for each strand value
    • The phase, qstart, and qstop fields should be filled in with dots (".")
    • create an attribute-value pair in the 13th column called "sample" with the value "0001"
    • deliverables: the ruby script you use to create the lff file
  2. Upload this data into Genboree (you may either use the API or upload it manually). Then, use the Segmentation tool (under Tools > Plugins) to select regions of high copy-number variance. Require that each segment contain at least 3 probes, and that the score of each segment exceed two standard deviations from the mean.
  3. Combine the resulting track with the segmented data from the other 185 tumors (all185.acgh.lff.gz) and select out only those segments that represent gains on chromosome 12, using the Annotation selector tool.
    • note: you don't have to unzip the file before uploading to Genboree
    1
  4. Now, upload the file refSeq.blocked.noSplice.lff.gz to your database. It will create a track called "RefSeq:Blocked" This is the refseq genes track with intronic sequences treated as part of the gene, and all of the the splice variants removed. Use the Attribute Lifter tool in Genboree to lift in the sample names from chr 12 gains that hit these genes.
  5. Click on the track name and use the tabular view to create a table with two columns - the gene name, and a comma-seperated list of matching samples. Download this table, then write a small ruby script that parses this table, and outputs only the few genes that are altered in more than 20 samples.
    • note: the "numIntersects" field is not a reliable indicator of how many samples match, only how many distinct annotations match. You'll have to write a script to count the entries in the samples attribute.
    • deliverables: the ruby script that parses your tabular output.
  6. Using this list of genes, along with tables chr12Table.acgh and exprTable.dat, calculate the Pearson's correlation between copy number and expression level for each of these genes. I'm giving you a small library (correlation.rb) that calculates the Pearson's correlation between two arrays of numbers. A usage example is given in 'example.rb'. In order for the require to work, make sure correlation.rb is in the same directory as your script (or is in your PATH).
    • You may have values for some copy-number altered genes that don't appear on the expression array. Obviously, you can't calculate correlations for these - just skip them.
    • Your script should remove any samples with "NA" values from the calculation for each gene. Otherwise these will cause an error to be thrown.
    • deliverables: the ruby script you wrote to calculate correlation, and a file containing four data columns: gene name, # of recurrrences, correlation value

  7. All deliverables:

    This assignment will be due in two weeks, on Feb 13, 2009.

    Zip the files up, title the zip with your name, and send them to chrisamiller@gmail.com.

    Feel free to contact me if you're having any problems. Email is usually the best way, and I'll almost always respond within an hour or two. We can also arrange a meeting - email me and we'll work out the details.

    I'll look over early submissions and if there are major problems, I'll return them to you and give you a chance to resubmit. Assignments completed closer to the due date may not get this opportunity.